My Iranian churn dataset was missing so I decided to choose the Seoul bike sharing demand dataset dataset found on UC Irvine Machine Learning Repository.
Abstract
The dataset contains count of public bikes rented at each hour in Seoul Bike haring System with the corresponding Weather data and Holidays information.
Data Set Information
Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.
The objective of this notebook is to analyze the dataset I get assigned and to make from there:
We can import here all the libraries we need for the notebook.
import json
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
plt.rcParams['figure.dpi'] = 200 # default = 72.0
plt.rcParams['font.size'] = 7.5 # default = 10.0
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.utils import all_estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import ARDRegression
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings("ignore")
First, I can import the dataset and show some lines to see what the features are and their format.
df = pd.read_csv("SeoulBikeData.csv")
df
We canc clearly see that out dataframe deserve some little changes to be more suitable to work with.
shorter_column_name = [
'date', # Date
'n_bike', # Rented Bike Count
'hour', # Hour
'temp', # Temperature (°C)
'hum', # Humidity (%)
'wind', # Wind speed (m/s)
'visb', # Visibility (10m)
'dew', # Dew point temperature (°C)
'solar', # Solar Radiation (MJ/m2)
'rain', # Rainfall(mm)
'snow', # Snowfall (cm)
'season', # Seasons
'holiday', # Holiday
'func' # Functioning Day
]
df.columns = shorter_column_name
df['func'] = (df['func'] == "Yes").astype(int)
df['holiday'] = (df['holiday'] == "Holiday").astype(int)
df['date'] = pd.to_datetime(df['date'], format="%d/%m/%Y")
df
Then, as we have only one categorical variable season (for now), we can display a short descriptions of the numerical features. Here we have :
We will go deeper in these descriptions during the analysis.
df.describe()
Let's start our analysis dealing with missing values. To be sure the dataset is cleaned, I compute the number of missing values for each columns on the entire dataset.
df.isnull().sum().reset_index(name='NA_count').sort_values(by='NA_count', ascending=False)
Surprisingly, there is no missing value (rare event). The dataset is very clean and homogeneous.
Before to dive into the analysis, let's explore the number of rented bikes on hours according to whether or not this is a functionning day.
df.groupby(['func']).agg({'n_bike': ['mean', 'std', 'count']}).reset_index()
We can see that the number of rented bike is 0 for hours when this is a non functionning day. In other words, we can not train a model on the features to predict ifthe number of rented bikes per hours based on the non functionning days. So I remove these samples from the dataset.
N = len(df)
print(f'Dataset number of samples before non-functionning days deletion: {len(df)}')
print(f'Number of non-functionning days: {len(df[df.func != 1])}')
df = df[df.func == 1].drop('func', axis=1)
print(f'Dataset number of samples after non-functionning days deletion: {len(df)}')
As I said before, the target will be the number of bike rented. More precisely, each sample gives the number of bike rented for each hours from 2017-12-01 to 2018-11-30. We can conclude it is a regression task to predict the number of bikes rented for a given hour based on the temporal and meteorologiical features the dataset contains.
For these hours, I want to see the distribution of the number of rented bikes.
N = len(df) // 100
plt.hist(x=df.n_bike, bins=N, color='#8800ff', alpha=0.5)
plt.title('Distribution of rented bike count per hour on fuctionning days')
plt.xlabel('n_bike')
plt.ylabel('count')
pass
TRANSFORMATION REMINDER >
Square Root
The square root method is typically used when your data is moderately skewed. Now using the square root is a transformation that has a moderate effect on distribution shape. It is generally used to reduce right skewed data. Finally, the square root can be applied on zero values and is most commonly used on counted data.
Logarithmic
The logarithmic is a strong transformation that has a major effect on distribution shape. This technique is, as the square root method, oftenly used for reducing right skewness. However, is that it can not be applied to zero or negative values.
Here we can see the distribution is right skewed. A square root or log transformation is needed to standize this distribution. I will choose a square root transformation because don't want a high effect because of the values from approximatively 500 to 1200. We see on this range 2 gaussian-like subdistribution.
(For the example I will plot both the distribution of log and sqrt distribution)
fig, axes = plt.subplots(1, 2)
# sqrt transformation
axes[0].hist(x=np.sqrt(df.n_bike), bins=N, color='#2f9599', alpha=0.5)
axes[0].set_title('Distribution of rented bike count per hour on fuctionning days')
axes[0].set_xlabel('sqrt(n_bike)')
axes[0].set_ylabel('count')
# log transformation
axes[1].hist(x=np.log(df.n_bike), bins=N, color='#999999', alpha=0.5)
axes[1].set_title('Distribution of rented bike count per hour on fuctionning days')
axes[1].set_xlabel('log(n_bike)')
axes[1].set_ylabel('count')
pass
We can conclude a sqrt is most suitable for the n_bike distribution to have a more gaussian-like distribution. Moreover, we can clearly see the 2 gaussian like distributions I'll spoke about before are leading to a left skewed distribution with a log-transform.
df['n_bike'] = np.sqrt(df.n_bike)
The second interesting part is that we have a kind of time serie dataset. High is the temptation to deal with this dataset with this way of thinking. As this dataset is about bike rental in time, I am not convinced this is a time serie question. Let's plot the number of bike rented for each hour over time.
df.plot(x='date', y='n_bike', color='#2f9599', alpha=0.05, kind='scatter', label='hour of rent')
plt.plot(df.date, df.n_bike.rolling(7*24, center=True).mean(), color='#8800ff', alpha=0.5, label='Average bike rented on centered week')
plt.plot(df.date, df.n_bike.rolling(30*24, center=True).mean(), color='#8800ff', label='Average bike rented on centered month')
plt.legend()
plt.title('Distribution of rented bike count per hour over time')
plt.xlabel('day')
plt.ylabel('sqrt(n_bike)')
pass
As I said, even looking at this plot infer the idea of a time serie. But if we look at the plot with a better precision and with the data knowledge, we can see the number rented bike is not really a time serie problem
With the knowledge of what a bike rent is, we can clearly suppose the number of rent is not given by the previous rents (even if a trend can be seen, only one year data is not sufficient to interpolate this trend). The number of rent is given by the week days habits and the seasons, thats why our meteorological features will be helpfull.
First, we can extract more informations from the date and hour to confirm my supposition. I extract :
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month_name()
df['day'] = df['date'].dt.day
df['week_day'] = df['date'].dt.day_name()
df['working_day'] = (df['date'].dt.dayofweek < 5).astype(np.int)
df = df.drop('date', axis=1)
time_cols = ['year', 'month', 'day', 'week_day', 'working_day', 'season', 'holiday']
df[time_cols + ['n_bike']]
My objective through the following analysis is to prove (or not) the time features are good categorical features (and not numerical ones for time series)
First I have a doubt on the efficience for the day number of the current month to give informations on the number of rented bikes.
plt.scatter(df.day, df.n_bike, color='#2f9599', alpha=0.05, label='hour of rent')
_temp = df.groupby('day').agg({'n_bike': ['mean', 'std']}).reset_index()
mean = _temp.n_bike['mean']
std = _temp.n_bike['std']
plt.plot(_temp.day, mean, color='#8800ff', label='mean')
plt.fill_between(_temp.day, mean - std, mean + std, color='#8800ff', alpha=0.1, label='1-std interval')
plt.legend()
plt.title('Distribution of rented bike count on day number')
plt.xlabel('day')
plt.ylabel('sqrt(n_bike)')
pass
To be honest, I was hoping there was a trend to rent more bikes at the beginning of the month because the salary is recent. But no. To be at the begining, at the middle or at the end of the month seems to have almost zero influence on our target.
Also, the year is not usable as the range of time is 2017-12-01 to 2018-11-30 so an exactly one-year window.
And what about the months and seasons ?
In this part, I will be interested to see if the month (categorical) is a good feature to predict the number of bikes rented for a given hour in a given day. To do so, I display some aggregates for each months and I plot the notch-boxplot of distribution for the relatives number of bike rented.
# group by month to get box plot
groups = df.groupby(by='month', sort=False).n_bike
labels = groups.count().reset_index(name='count').month.values
groups = [groups.get_group(k) for k in groups.groups]
# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
groups, notch=True, vert=True, patch_artist=True, labels=labels,
capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
patch.set_color(colors[i%len(colors)])
plt.title('Distribution of rented bike count according to the month')
plt.xlabel('month')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
df.groupby('month', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
Here we clearly see monthes has an impact on the number of bikes rented. Moreover we can suppose the season will be a good resume of this plot. Indeed we see for the winter monthes less bikes rented and inversely.
In this part, I will show the same things I did for the month to see if the seasons are good features and a good summary of the month-result above.
# group by season to get box plot
groups = df.groupby(by=['season'], sort=False).n_bike
labels = groups.count().reset_index(name='count').season.values
groups = [groups.get_group(k) for k in groups.groups]
# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
groups, notch=True, vert=True, patch_artist=True, labels=labels,
capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
patch.set_color(colors[i%len(colors)])
plt.title('Distribution of rented bike count according to the season')
plt.xlabel('season')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
df.groupby('season', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
As I said, during the winter months there are less bike rented.
In this part, I will do the same thing as the month and season features but on the holiday one.
# group by holiday to get box plot
groups = df.groupby(by=['holiday'], sort=False).n_bike
labels = ['No Holiday', 'Holiday']
groups = [groups.get_group(k) for k in groups.groups]
# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
groups, notch=True, vert=True, patch_artist=True, labels=labels,
capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
patch.set_color(colors[i%len(colors)])
plt.title('Distribution of rented bike count according to the holiday')
plt.xlabel('holiday')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
df.groupby('holiday', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
It is not as clear as for the seasons or the months but there is a trend for less bike rented during holidays. We almost can conclude these bikes are rented to go to work.
If my previous conclusion is valid, then, by plotting the mean number of rented bikes according to the hour we will see two peaks around 8h00 and 18h00.
_temp = df.groupby('hour').agg({'n_bike': ['mean', 'std']}).reset_index()
mean = _temp.n_bike['mean']
std = _temp.n_bike['std']
plt.plot(_temp.hour, mean, color='#8800ff', label='mean')
plt.fill_between(_temp.hour, mean - std, mean + std, color='#8800ff', alpha=0.1, label='1-std interval')
plt.legend()
plt.title('Distribution of rented bike count according to day hours')
plt.xlabel('hour')
plt.ylabel('sqrt(n_bike)')
pass
As I supposed, we can see these peaks. So I can make a better supposition. This distribution will be improved for working days (monday to friday in Corea) and we probably can see another distribution for the weekend days.
For this part, I will plot the distribution of the number of rented bikes according to the hour. I will do it splitting the curves :
fig, axes = plt.subplots(1, 2)
# Weekdays plot
_temp = df.groupby(['week_day', 'hour']).agg({'n_bike': ['mean', 'std']}).reset_index()
colors = ['#80f', '#80f8', '#2f9599', '#2f959988', '#999', '#ccc', '#aaa']
for week_day, color in zip(_temp.week_day.unique(), colors):
mean = _temp.n_bike[_temp.week_day == week_day]['mean']
axes[0].plot(_temp[_temp.week_day == week_day].hour, mean, color=color, label=f'{week_day} - mean')
axes[0].legend()
axes[0].set_title('Distribution of rented bike count according to day hours')
axes[0].set_xlabel('hour')
axes[0].set_ylabel('sqrt(n_bike)')
# working days plot
_temp = df.groupby(['working_day', 'hour']).agg({'n_bike': ['mean', 'std']}).reset_index()
colors = ['#8800ff', '#2f9599']
for working_day, color in zip(_temp.working_day.unique(), colors):
mean = _temp.n_bike[_temp.working_day == working_day]['mean']
std = _temp.n_bike[_temp.working_day == working_day]['std']
axes[1].plot(_temp[_temp.working_day == working_day].hour, mean, color=color, label=f'{"Is" if working_day else "Is not"} working day- mean')
axes[1].fill_between(_temp[_temp.working_day == working_day].hour, mean - std, mean + std, color=color, alpha=0.05, label=f'{"Is" if working_day else "Is not"} working day- 1-std interval')
axes[1].legend()
axes[1].set_title('Distribution of rented bike count according to day hours')
axes[1].set_xlabel('hour')
axes[1].set_ylabel('sqrt(n_bike)')
pass
Bingo ! The distribution between :
To conclude : I was right saying the time serie idea is not as good as we can suppose first. The months, seasons, holidays, hours etc. as categorical are features are far most better features than a continuous time.
In this part I will explain what are the other features. Only remains the meteorological ones.
The dew point (°C) or dew temperature (°C) is the temperature below which condensation naturally occurs. More technically, below this temperature which depends on the ambient pressure and humidity, the water vapor contained in the air condenses on surfaces, by saturation effect.
Concretely, the more the dew point and the temperature are high, the more the bike rider will feel confortable (asssuming natural Corean temperatures).
I will deal with these two features simultaneously as we can assume these temperatures are correlated. Of course, I will plot a first scatter point to validate this assumption. Then I will plot the temperatures distribution to see if it need some transformation. Then the last plot is showing the relation between the temperatures and the number of bikes rented (already sqrt-transformed).
plt.rcParams['figure.figsize'] = [12, 10] # default = [6.0, 4.0]
N = len(df) // 100
# tempature and dew point comparison
ax = plt.subplot2grid((2, 2), (0, 0), colspan=2)
df.plot(x='temp', y='dew', kind='scatter', color='#8800ff', alpha=0.05, ax=ax, label='Given hour')
# add regression line
x, y = df.temp, df.dew
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Temperature and Dew point temperature comparison')
ax.set_xlabel('Temperature')
ax.set_ylabel('Dew point temperature')
ax.legend()
# Temperatures distributions
ax = plt.subplot2grid((2, 2), (1, 0), colspan=1)
ax.hist(x=df.temp, bins=N, color='#8800ff', alpha=0.3, label='Temperature')
ax.hist(x=df.dew, bins=N, color='#2f9599', alpha=0.3, label='Dew point')
# add legends
ax.set_title('Distribution of the temperatures')
ax.set_xlabel('Temperatures')
ax.set_ylabel('Count')
ax.legend()
# Temperatures and rented bikes distribution
ax = plt.subplot2grid((2, 2), (1, 1), colspan=1)
df.plot(x='temp', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Temperature at a given hour', ax=ax)
df.plot(x='dew', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Dew point at a given hour', ax=ax)
# add regression lines
x, y = df.temp, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression of Temperature')
x, y = df.dew, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression of Dew point')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the temperatures')
ax.set_xlabel('Temperatures')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
pass
The two tempatures seems correlated. So it is a good point for me I can compare them. Moreover, their distribution is near a gaussian one, so I let these two features without transformation. Finally, we can see the concrete argument I said before : the more the dew point and the temperature are high, the more the bike rider will feel confortable. It leads to more bike rented
Relative air humidity (%) is a measure of the ratio between the water vapor content of the air and its maximum capacity to contain water vapor under these conditions.
The relative humidity of the ambient air influences the evaporation of sweat, and thus the cooling of the body. Too low a humidity level increases cooling and increases the efficiency of perspiration, while too high a humidity level limits cooling and thus increases the feeling of warmth.
I will plot the humidity distribution to see if it need some transformation. Then the last plot is showing the relation between the humidity and the number of bikes rented (already sqrt-transformed).
N = len(df) // 100
# Humidity distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.hum, bins=N, color='#8800ff', alpha=0.3, label='Humidity')
# add legends
ax.set_title('Distribution of the Humidity')
ax.set_xlabel('hum')
ax.set_ylabel('Count')
ax.legend()
# Humidity and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='hum', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Humidity at a given hour', ax=ax)
# add regression line
x, y = df.hum, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Humidity')
ax.set_xlabel('hum')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
The humidity distribution is near a gaussian one (except near 100%), so I also let it without transformation. And, we can see the concrete argument I said before : the more the humidity is low, the more the bike rider will feel confortable. It leads to more bike rented.
For the wind speed (m/sec.), nothing special to say except that there is no strong wind which is considered to be from 14 m/sec. I out dataset the max is only 7.4 m/sec. so I am not sure this features will have a strong influence.
Yet, let's plot the wind speed distribution to see if it need some transformation. Then the last plot is showing the relation between the wind speed and the number of bikes rented (already sqrt-transformed).
N = len(df) // 100
# Wind speed distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.wind, bins=N, color='#8800ff', alpha=0.3, label='Wind speed')
# add legends
ax.set_title('Distribution of the Wind speed')
ax.set_xlabel('wind')
ax.set_ylabel('Count')
ax.legend()
# Wind speed and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='wind', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Wind speed at a given hour', ax=ax)
# add regression line
x, y = df.wind, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Wind speed')
ax.set_xlabel('wind')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
I expected to see that the wind speed will lead to less bikes rented but it is the opposite. I am suprised of this result but I can suppose it was not depending only on the wind but with crossed features, or maybe the wind can be confortable for the rider in some mesure. I also see that the wind speed has a right skewed distribution. So I will apply a log-transform on this feature.
df['wind'] = np.log(df['wind'] + 1)
N = len(df) // 200
# Wind speed distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.wind, bins=N, color='#2f9599', alpha=0.3, label='Wind speed')
# add legends
ax.set_title('Distribution of the Wind speed')
ax.set_xlabel('log(wind)')
ax.set_ylabel('Count')
ax.legend()
# Wind speed and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='wind', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Wind speed at a given hour', ax=ax)
# add regression line
x, y = df.wind, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Wind speed')
ax.set_xlabel('log(wind)')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
As know, the solar radiance (W/m2) corresponds to a flow of radiation from the sun. In other words, the more there are radiation, the more we are exposed to the sun.
I plot the solar radiation distribution to see if it need some transformation. Then the other plot is showing the relation between the solar radiation and the number of bikes rented (already sqrt-transformed).
N = len(df) // 100
# Solar radiationdistributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.solar, bins=N, color='#8800ff', alpha=0.3, label='Solar radiation')
# add legends
ax.set_title('Distribution of the Solar radiation')
ax.set_xlabel('solar')
ax.set_ylabel('Count')
ax.legend()
# Solar radiation and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='solar', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Solar radiation at a given hour', ax=ax)
# add regression line
x, y = df.solar, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Solar radiation')
ax.set_xlabel('solar')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
Here three conclusions can be made :
So I will apply a square root transformation for three reasons :
df['solar'] = np.sqrt(df['solar'])
N = len(df) // 100
# Solar radiationdistributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.solar, bins=N, color='#2f9599', alpha=0.3, label='Solar radiation')
# add legends
ax.set_title('Distribution of the Solar radiation')
ax.set_xlabel('sqrt(solar)')
ax.set_ylabel('Count')
ax.legend()
# Solar radiation and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='solar', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Solar radiation at a given hour', ax=ax)
# add regression line
x, y = df[df.solar>0.05].solar, df[df.solar>0.05].n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression (sqrt(solar) > 0.05)')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Solar radiation')
ax.set_xlabel('sqrt(solar)')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
The rainfall correspond (mm) to the rain fallen during the current hour. The more there are rain, the riskier riding is.
The rain fall is given by the rain fallen for a given our. So estimating the rain have a effect on dryness for 2 hours I will apply moving average of rainfall for the last 2 hours.
As I did for the others meteorological features, I plot the rainfall distribution to see if it need some transformation. Then the other plot is showing the relation between the rainfall and the number of bikes rented (already sqrt-transformed).
df['rain'] = df.rolling(2, min_periods=1)['rain'].mean()
N = len(df) // 100
# Rainfall distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.rain, bins=N, color='#8800ff', alpha=0.3, label='Rainfall')
# add legends
ax.set_title('Distribution of the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('Count')
ax.legend()
# Rainfall and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='rain', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Rainfall at a given hour', ax=ax)
# add regression line
x, y = df.rain, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
The plot showing the regression line is clearly indicates the more is it rainy, the less the bikes are used. But I can't clearly see the distribution when there are rain. Let's filter our plots on rainy hours.
_df = df.copy()
_df = _df[_df.rain > 0]
N = len(_df) // 100
# Rainfall distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=_df.rain, bins=N, color='#8800ff', alpha=0.3, label='Rainfall')
# add legends
ax.set_title('Distribution of the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('Count')
ax.legend()
# Rainfall and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
_df.plot(x='rain', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Rainfall at a given hour', ax=ax)
# add regression line
x, y = _df.rain, _df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Rainfall')
ax.set_xlabel('rain')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
Here we can see, the rainy hours are really really right skewed. Moreover, we clearly see the inverse proportionnality with the number of rented bikes. So we will try another feature dryness (1/mm) given by $$dryness = \frac{1}{rain + 1}$$ avoiding ZeroDivisionError.
df['dryness'] = 1 / (df.rain + 1)
df = df.drop('rain', axis=1)
N = len(df[df.dryness < 1]) // 100
# Dryness distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df[df.dryness < 1].dryness, bins=N, color='#2f9599', alpha=0.3, label='Dryness')
# add legends
ax.set_title('Distribution of the Dryness (dryness < 1)')
ax.set_xlabel('dryness')
ax.set_ylabel('Count')
ax.legend()
# Dryness and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='dryness', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='Dryness at a given hour', ax=ax)
# add regression line
x, y = df[df.dryness < 1].dryness, df[df.dryness < 1].n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression (dryness < 1)')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the dryness')
ax.set_xlabel('dryness')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
It is a really interpretative transformation. I will keep it as :
The snowfall correspond (mm) to the rain fallen during the current hour. The more there are rain, the riskier riding is.
The snow fall is given by the snow fallen for a given our. So estimating the snow have a effect on dryness for 8 hours in the city of Seoul I will apply moving average of snowfall for the last 8 hours.
I plot the snowfall distribution to see if it need some transformation. Then the other plot is showing the relation between the snowfall and the number of bikes rented (already sqrt-transformed).
df['snow'] = df.rolling(8, min_periods=1)['snow'].mean()
N = len(df) // 100
# Snowfall distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.snow, bins=N, color='#8800ff', alpha=0.3, label='Snowfall')
# add legends
ax.set_title('Distribution of the Snowfall')
ax.set_xlabel('snow')
ax.set_ylabel('Count')
ax.legend()
# Snowfall and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='snow', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Snowfall at a given hour', ax=ax)
# add regression line
x, y = df.snow, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Snowfall')
ax.set_xlabel('snow')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
To be honest, I was thinking the rainfall and the snowfall will be the same. Of course, at first glance, we can conclude the snower, the more the bike will be rented. Yet looking more precisely, we see it is more a categorical variable. If there are 1 or 10 mm of snow, the number of bike rented seems to follow the same distribution. In other words, snow is like disqualifying. So I choose to transform this variable into a boolean one and see the results on a notch box plot
_df = df.copy()
_df['snowing'] = (_df.snow > 0).astype(np.int)
# group by snowing to get box plot
groups = _df.groupby(by=['snowing'], sort=False).n_bike
labels = ['snowing', 'Not snowing']
groups = [groups.get_group(k) for k in groups.groups]
# custom colors
gray_dict = dict(color='#99999988', markeredgecolor='#999999')
colors = ['#8800ff88', '#2f959988']
boxplot = plt.boxplot(
groups, notch=True, vert=True, patch_artist=True, labels=labels,
capprops=gray_dict, whiskerprops=gray_dict, flierprops=gray_dict, medianprops=gray_dict
)
# change box colors
for i, patch in enumerate(boxplot['boxes']):
patch.set_color(colors[i%len(colors)])
plt.title('Distribution of rented bike count according to the snowing')
plt.xlabel('snowing')
plt.ylabel('sqrt(n_bike)')
# display multiple aggregates and remove the temporary columns
_df.groupby('snowing', sort=False).agg({'n_bike': ['count', 'sum', 'mean', 'min', 'max', 'median', 'std']}).apply(lambda row: [f'{x:.1f}' for x in row])
Indeed, we can see the distribution of number of bike rented is restricted. So the feature snowing is a great categorical features for snowy hours.
df['snowing'] = (df.snow > 0).astype(np.int)
df = df.drop('snow', axis=1)
Only remains the visibility (10m) feature, which correspond to the visibility at horizon for human eye (default 2000)
I plot the visibility distribution to see if it need some transformation. Then the other plot is showing the relation between the visibility and the number of bikes rented (already sqrt-transformed).
N = len(df) // 100
# Visibility distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.visb, bins=N, color='#8800ff', alpha=0.3, label='Visibility')
# add legends
ax.set_title('Distribution of the Visibility')
ax.set_xlabel('visb')
ax.set_ylabel('Count')
ax.legend()
# Visibility and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='visb', y='n_bike', kind='scatter', color='#8800ff', alpha=0.05, label='Visibility at a given hour', ax=ax)
# add regression line
x, y = df.visb, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#8800ff', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the Visibility')
ax.set_xlabel('visb')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
As we can see, the visb is by default at 2000 which is weird to interpret as a maximum visibility. So I decide to change the visibility feature to invisb which reverts the distribution. It will be interpreted as the removed (reduced) visibility (10m)
df['invisb'] = df.visb.max() - df.visb
df = df.drop('visb', axis=1)
N = len(df) // 100
# reduced visibility distributions
ax = plt.subplot2grid((1, 2), (0, 0), colspan=1)
ax.hist(x=df.invisb, bins=N, color='#2f9599', alpha=0.3, label='reduced visibility')
# add legends
ax.set_title('Distribution of the reduced visibility')
ax.set_xlabel('invisb')
ax.set_ylabel('Count')
ax.legend()
# reduced visibility and rented bikes distribution
ax = plt.subplot2grid((1, 2), (0, 1), colspan=1)
df.plot(x='invisb', y='n_bike', kind='scatter', color='#2f9599', alpha=0.05, label='reduced visibility at a given hour', ax=ax)
# add regression line
x, y = df.invisb, df.n_bike
m, b = np.polyfit(x, y, 1)
ax.plot(x, m * x + b, color='#2f9599', label='Linear regression')
# add legends
ax.set_title('Distribution of the n_bike (transformed) according to the reduced visibility')
ax.set_xlabel('invisb')
ax.set_ylabel('sqrt(n_bike)')
ax.legend()
pass
Now it is easier to interpret. The more the visibility is reduced, the less the bikes are rented.
Before completing the preprocessing of the dataset, I switch the variable containing this data from df to data and we can have a simple look at this data as a reminder.
data = df
data
Let's remember that holiday, working_day, snowing are already one hot encoded as they are boolean feautres. Only remains :
Important note : I save every dummies in a dummies.json file
categorical_columns = ['hour', 'season', 'month', 'day', 'week_day']
dummies = {}
for column in categorical_columns:
unique = [str(x) for x in list(data[column].unique())]
dummies[column] = unique
# save file
with open('dummies.json', 'w') as f:
json.dump(dummies, f, indent=4)
def dummies_from_file(df, file_name):
_df = df.copy()
with open(file_name) as f:
dummies = json.load(f)
for column, values in dummies.items():
for value in values:
_df[f'{column}_{value}'] = (_df[column].astype(str) == value).astype(int)
_df = _df.drop(column, axis=1)
return _df
data = dummies_from_file(data, 'dummies.json')
data
Now we have all our categorical features ready to use. Let's plot the histogram and describe the numerical ones as a reminder.
N = len(data) // 100
numerical_columns = ['n_bike', 'temp', 'hum', 'wind', 'dew', 'solar', 'dryness']
plt.rcParams['figure.figsize'] = [12, 6] # default = [6.0, 4.0]
fig, axes = plt.subplots(2, 4)
axes = axes.flatten()
axes[-1].remove()
for column, ax in zip(numerical_columns, axes):
ax.hist(x=data[column], bins=N, color='#8800ff', alpha=0.3)
# add legends
ax.set_title(f'Distribution of {column}')
plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
data[numerical_columns].describe()
The humidity, and the dryness are in a given interval ([0, 100] and [0, 1]) so I will normalize these features. I will also normalize the solar feature as I want to keep the 0 value the same.
Remains the number of bike, the temperature, the dew point and the wind I will standardize. Particular point for n_bike : I put the mean and std in variables to unfit it later on predictions. And I saved the transformation applied in a transformations.json file
MEAN_LOG_N_BIKE = data.n_bike.mean()
STD_LOG_N_BIKE = data.n_bike.std()
print(f'MEAN_LOG_N_BIKE: {MEAN_LOG_N_BIKE:.2f}')
print(f'STD_LOG_N_BIKE: {STD_LOG_N_BIKE:.2f}')
transformations = {}
# standardization
for column in ['n_bike', 'temp', 'wind', 'dew']:
mean, std = data[column].mean(), data[column].std()
transformations[column] = {'function' : 'standardization', 'mean': float(mean), 'std': float(std)}
# Normalization
for column in ['hum', 'solar', 'dryness']:
maxi = data[column].max()
transformations[column] = {'function' : 'normalization', 'maxi': float(maxi)}
# save file
with open('transformations.json', 'w') as f:
json.dump(transformations, f, indent=4)
def norm_from_file(df, file_name):
_df = df.copy()
with open(file_name) as f:
transformations = json.load(f)
for column, t in transformations.items():
if t['function'] == 'standardization':
_df[column] = (_df[column] - t['mean']) / t['std']
elif t['function'] == 'normalization':
_df[column] = _df[column] / t['maxi']
return _df
data = norm_from_file(data, 'transformations.json')
data
N = len(data) // 100
numerical_columns = ['n_bike', 'temp', 'hum', 'wind', 'dew', 'solar', 'dryness']
plt.rcParams['figure.figsize'] = [12, 6] # default = [6.0, 4.0]
fig, axes = plt.subplots(2, 4)
axes = axes.flatten()
axes[-1].remove()
for column, ax in zip(numerical_columns, axes):
ax.hist(x=data[column], bins=N, color='#2f9599', alpha=0.3)
# add legends
ax.set_title(f'Distribution of {column}')
plt.rcParams['figure.figsize'] = [12, 5] # default = [6.0, 4.0]
data[numerical_columns].describe()
X, y = data.drop('n_bike', axis=1), data['n_bike']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')
To have a idea on how the preprocessing is important I think it is fun to create a train-test split of the dataset without preprocessing. Of course if I am honest, I will at least make a one hot encoding of categorical variables days, remove the non functionning days and date will be timestamped to be considered as a time serie.
__data = pd.read_csv("SeoulBikeData.csv")
__data.columns = shorter_column_name
__data['func'] = (__data['func'] == "Yes").astype(int)
__data = __data[__data.func == 1].drop('func', axis=1)
__data['holiday'] = (__data['holiday'] == "Holiday").astype(int)
__data['date'] = pd.to_datetime(__data['date'], format="%d/%m/%Y").astype('int64') // 10 ** 9
__data = pd.get_dummies(__data, prefix=['season'], columns=['season'])
__data
__X, __y = __data.drop('n_bike', axis=1), __data['n_bike']
__X_train, __X_test, __y_train, __y_test = train_test_split(__X, __y, test_size=0.30, random_state=42)
print(f'__X_train shape: {__X_train.shape}')
print(f'__y_train shape: {__y_train.shape}')
print(f'__X_test shape: {__X_test.shape}')
print(f'__y_test shape: {__y_test.shape}')
plt.scatter(__X_train.date, __y_train, color='#8800ff', alpha=0.05, label='Training sample')
plt.scatter(__X_test.date, __y_test, color='#2f9599', alpha=0.1, label='Testing sample')
plt.legend()
plt.title('Distribution of sample over time')
plt.xlabel('timestamp')
plt.ylabel('n_bike')
pass
As I already said, we have a regression problem. I need to test some regression models. To do so, let's display the models sklearn give us.
all_regs = np.array([name for name, RegressorClass in all_estimators(type_filter='regressor')])
for i in range(0, len(all_regs) // 3, 3):
j = min(i + 3, len(all_regs))
print('| '.join([f'{reg:30}' for reg in all_regs[i:j]]))
There are A LOT of models. I decide to choose some of these models :
IMPORTANT: For each model (except the Baseline of course), the cells will be the exact same except for the first one where the grid search hyperparameters are defined. It leads to an homogeneous way to compare these models. So read the code once.
Before diving into model grid searching and fitting, let's talk about my objectives. For each model :
To compute RMSE I need a custom method to return the loss value and unfit the transformed target if needed
def target_unfit(y):
y = (y * STD_LOG_N_BIKE + MEAN_LOG_N_BIKE) ** 2
return y
def RMSE(pred, y, unfit=True):
if unfit:
pred = target_unfit(pred)
y = target_unfit(y)
error = pred - y
rsme_error = np.sqrt((error ** 2).mean())
return rsme_error
Then I need a method to compute the relative error distribution to have an idea of the error made by the models. (I try this method with random false data)
N = 5000
x = np.random.random(N)
y = x ** 2 * np.random.random(N) + x * np.random.random(N)
def relative_error_distrib_plot(pred, y, model_name, unfit=True, color='#8800ff', alpha=0.5, label='Sample', ax=None):
if unfit:
pred = target_unfit(pred)
y = target_unfit(y)
error = pred - y
if ax is not None:
ax.hist(error, bins=len(y)//100, color=color, alpha=alpha, label=label)
ax.legend()
ax.set_title(f'Relative error distribution with {model_name}')
ax.set_xlabel('Relative error')
ax.set_ylabel('Count')
else:
plt.hist(error, bins=len(y)//100, color=color, alpha=alpha, label=label)
plt.legend()
plt.title(f'Relative error distribution with {model_name}')
plt.xlabel('Relative error')
plt.ylabel('Count')
relative_error_distrib_plot(x, y, 'no model', unfit=False, label='False sample')
And I also need a method to plot the same relative error according to the target values (Also tested on the false random data)
def relative_error_plot(pred, y, model_name, unfit=True, color='#8800ff', alpha=0.05, label='Sample', ax=None):
if unfit:
pred = target_unfit(pred)
y = target_unfit(y)
error = pred - y
# add regression lines
m, b = np.polyfit(y, error, 1)
if ax is not None:
ax.scatter(y, error, color=color, alpha=alpha, label=label)
ax.plot(y, m * y + b, color='#fff', linewidth=2)
ax.plot(y, m * y + b, color=color, label=f'Linear regression of {label}')
ax.legend()
ax.set_title(f'Relative error with {model_name}')
ax.set_xlabel('n_bike')
ax.set_ylabel('Relative error')
else:
plt.scatter(y, error, color=color, alpha=alpha, label=label)
plt.plot(y, m * y + b, color='#fff', linewidth=3)
plt.plot(y, m * y + b, color=color, label=f'Linear regression of {label}')
plt.legend()
plt.title(f'Relative error with {model_name}')
plt.xlabel('n_bike')
plt.ylabel('Relative error')
relative_error_plot(x, y, 'no model', unfit=False, label='False sample')
Last but ot least, I need a method to perform the grid search homogeneously to be able to honestly compare the following models. It performs a cross-validation shuffling the training set into 4 parts. Also, a random seed is set to make my experiments repoductible.
def grid_search(X, y, model, param_grid={}):
rkwargs = {'random_state':42}
try:
model(**rkwargs)
except:
rkwargs = {}
grid = GridSearchCV(
model(**rkwargs),
param_grid=param_grid,
cv=ShuffleSplit(n_splits=4, random_state=42),
verbose=0,
n_jobs=-1
)
grid.fit(X, y)
return grid
My first model is not a model made by sklearn. It is always a good idea to have a baseline of the error without any model. In other words, I take the mean target for the two datasets and I compute the error on the respectives test sets.
pred = y_train.mean()
rmse = RMSE(pred, y_test, unfit=True)
__pred = __y_train.mean()
__rmse = RMSE(__y_train.mean(), __y_test, unfit=False)
RESULTS = pd.DataFrame.from_dict(data={'model': ['Baseline'], 'is_preprocess':[1], 'RMSE':[rmse]})
RESULTS = RESULTS.append({'model': 'Baseline', 'is_preprocess':0, 'RMSE':__rmse}, ignore_index=True)
RESULTS
We can see our preprocessing on the target (sqrt) gives us a poorest baseline on the target. Nothing to worry, on the contrary, it is logical to have a worst prediction as the mean is driven by the extrems. So with our sqrt transform, it leads to underestimate the high extremes.
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, 'Baseline', label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, 'Baseline', unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, 'Baseline', label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, 'Baseline', unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
Of course we can see the predictions for with Baseline model are better on the dataset without transformation. Moreover we know the Baseline model on the transformed dataset is underestimated the number of bike rented.
Model Name: LinearRegression
Description: Ordinary Least Squares algorithm
Prevents Overfitting: no
Handles Outliers: no
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 5 / 5
When to Use: Highly interpretable, no introduced bias
When to Use Expanded:
$\qquad$- Data consists of few outliers
$\qquad$- Little variance between output labels
$\qquad$- All of the input features are not only independent but also are not correlated.
Advantages:
$\qquad$- Easy to interpret results
$\qquad$- Low complexity level
Disadvantages:
$\qquad$- At risk of multicolinearity if input features are correlated
$\qquad$- Small errors/outliers in target values can drastically impact model
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None
Grid Search:
fit_intercept: bool, default=True
Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).normalize: bool, default=False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
model = LinearRegression
model_name = type(model()).__name__
param_grid = {
'fit_intercept': [True, False],
'normalize': [True, False]
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
Here we have our first prove of the transformation's effience. For this simple linear model we have a clearly better prediction with the transformed dataset. It proves the new featuresare individually meaningful compared to the raw ones. So we can try the ElasticNet which can be useful when there are some correlation between input features and to avoid overfitting.
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)
params = grid.best_params_
__params = __grid.best_params_
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Model Name: ElasticNet
Description: Ordinary Least Squares with both an L1 and L2 regularization term. The weights of the L1 vs. L2 regularization terms are controlled by an l1_ratio parameter.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: yes
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 3 / 5
When to Use: Blend Ridge and Lasso
When to Use Expanded:
$\qquad$- Data consists of few outliers
$\qquad$- May be some correlation between input features
$\qquad$- Avoid overfitting
$\qquad$- Feature selection
Advantages:
$\qquad$- Incorporate the feature selection abilities of Lasso with the regularization abilities of Ridge.
Disadvantages:
$\qquad$- In lowering variance, incorporates a degree of bias into the model.
$\qquad$- Can be difficult to tune alpha to attain a desirable balance between OLS and regularization terms
$\qquad$- Higher computational cost than Ridge or Lasso
Sklearn Package: linear
Required Args: None
Helpful Args: alpha (controls strength of regularization terms) l1_ration (controls ratio between L1 and L2 regularization terms)
Variations: ElasticNetCV MultiTaskElasticNet
Grid Search:
alpha: float, default=1.0
Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact >mathematical meaning of this parameter. alpha = 0 is equivalent to an ordinary least square, >solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso >object is not advised. Given this, you should use the LinearRegression object.
l1_ratio: float, default=0.5The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
model = ElasticNet
model_name = type(model()).__name__
param_grid = {
'alpha': [0.1, 0.25, 0.5, 0.75, 0.9, 1],
'l1_ratio': [0, 0.2, 0.5, 0.7, 1],
'max_iter': [10000]
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
In this case, the model on untransformed dataset is better. It is because data consists of fewer outliers compare to the transformed dataset due to the high number of features. I am tented to try the HuberRegressor, which is good for quick analyses of data ignoring outliers even if it can break down with large numbers of input features.
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)
params = grid.best_params_
__params = __grid.best_params_
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Model Name: HuberRegressor
Description: A linear model designed to deal with outliers in the data and/or corrupted data. Does not ignore the outliers, but rather gives them a lower weight.
Prevents Overfitting: no
Handles Outliers: yes
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: no
Interpretability Score: 4 / 5
When to Use: Outliers and want quickest algorithm
When to Use Expanded:
$\qquad$- Want quick analyses of data ignoring outliers
Advantages: '
$\qquad$- Faster than RANSAC and TheilSen (as long as the number of samples is not too large)
$\qquad$- Does not completely ignore data points it deems as outliers
Disadvantages:
$\qquad$- Break down with large numbers of input features
Sklearn Package: linear
Required Args: None
Helpful Args: max_iter (maximum number of iterations to perform)
Variations: None
Grid Search:
epsilon: float, greater than 1.0, default 1.35 The parameter epsilon controls the number of samples that should be classified as outliers. The smaller the epsilon, the more robust it is to outliers.
model = HuberRegressor
model_name = type(model()).__name__
param_grid = {
'epsilon': [1, 1.2, 1.35, 1.5, 2, 3, 5, 10],
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
Interestingly, the HuberRegressor does not work on the untransformed dataset. It seems to underfits the data by completly ignoring the outliers. I have to make a compromise between ElasticNet and HuberRegressor so I will choose to try the Bayensian Rigde.
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)
params = grid.best_params_
__params = __grid.best_params_
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Model Name: BayesianRidge
Description: Similar to Ridge but the regularization parameter is tuned to fit the data during the training process.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: no
Adaptive Regularization: yes
Large Dataset: no
Non linear: no
Interpretability Score: 2 / 5
When to Use: Ridge but don't want to set regularization constant
When to Use Expanded:
$\qquad$- Are seeking results similar to Ridge, but willing to sacrifice interpretability for time saved not having to test different regularization weights
Advantages:
$\qquad$- No need to tune alpha value
$\qquad$- Adapts well to data on hand
Disadvantages:
$\qquad$- Less interpretable results
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None
Grid Search:
n_iter: int, default=300
Maximum number of iterations. Should be greater than or equal to 1.
model = BayesianRidge
model_name = type(model()).__name__
param_grid = {
'n_iter': [5, 20, 50, 100, 200, 300],
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
Here we are ! The BayesianRigde show us one thing. It can be hard to be sure of that but let's summarize my thoughs :
So I think there is no regularization needed as the LinearRegression is the best model for now and the other models tried are some variants with Regularization. I am no sure, if I try another variant, it will lead to the same conclusion. Let's go for the ARDRegression.
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)
params = grid.best_params_
__params = __grid.best_params_
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Model Name: ARDRegression
Description: BayesianRidge with sparser weight values. Almost like a version of BayesianLasso.
Prevents Overfitting: yes
Handles Outliers: no
Handles several features: yes
Adaptive Regularization: yes
Large Dataset: no
Non linear: no
Interpretability Score: 2 /5
When to Use: Lasso but don't want to set regularization constant
When to Use Expanded:
$\qquad$- Are seeking results similar to Lasso, but willing to sacrifice interpretability for time saved not having to test different regularization term weights
Advantages:
$\qquad$- No need to tune alpha value
$\qquad$- Adapts well to data on hand
$\qquad$- Reduces weight of unimportant features
Disadvantages: '
$\qquad$- Less interpretable results
$\qquad$- Computationally expensive (can't handle very large datasets)
Sklearn Package: linear
Required Args: None
Helpful Args: None
Variations: None
Grid Search:
n_iter: n_iterint, default=300
Maximum number of iterations.
model = ARDRegression
model_name = type(model()).__name__
param_grid = {
'n_iter': [5, 20, 50, 100, 200, 300],
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
And yes, I obtain the same conclusion I had for the BayesianRidge. It is time to stop trying linear models and move on for other regressors. I will try the KNeighborsRegressor.
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)
params = grid.best_params_
__params = __grid.best_params_
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Model Name: KNeighborsRegressor
Description: Creates a model based off of the k nearest neighbors at any given point. Where k is an input argument.
Prevents Overfitting: no
Handles Outliers: no
Handles several features: no
Adaptive Regularization: no
Large Dataset: no
Non linear: yes
Interpretability Score: 5 / 5
When to Use: Nonlinear data, interpretability is important, unimportant features
When to Use Expanded:
$\qquad$- When you are unsure of the structure of your data and want a model that will fit well
$\qquad$- Not concerned with overfitting
$\qquad$- Interpretability is important
Advantages:
$\qquad$- Fits very well to data of various structures
$\qquad$- More interpretable than other nonlinear models
Disadvantages:
$\qquad$- Extremely impacted by outliers and corrupt data
$\qquad$- Need several more samples than features for quality results
$\qquad$- Difficulty dealing with large numbers of features
Sklearn Package: neighbors
Required Args: None
Helpful Args: n_neighbors (number of neighbors to use)
Variations: None
Grid Search:
n_neighbors: sint, default=5
Number of neighbors to use by default for kneighbors queries.weights: str or callable, default=’uniform’
weight function used in prediction. Possible values: ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. And ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.p: int, default=2
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
model = KNeighborsRegressor
model_name = type(model()).__name__
param_grid = {
'n_neighbors': [3, 5, 10, 15, 20],
'weights': ['uniform', 'distance'],
'algorithm': ['ball_tree', 'kd_tree'],
'p': [0.5, 1, 2]
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
We have really poor results on the transformed dataset. It can be easly explained :
One RandomForestRegressor can deal with these disadvantages.
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)
params = grid.best_params_
__params = __grid.best_params_
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
Model Name: RandomForestRegressor
Description: A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
Prevents Overfitting: yes
Handles Outliers: yes
Handles several features: yes
Adaptive Regularization: no
Large Dataset: yes
Non linear: yes
Interpretability Score: 3 / 5
When to Use: Nonlinear data groups in buckets
When to Use Expanded:
$\qquad$- Data is not linear and is composed more of "buckets"
$\qquad$- Number of samples > number of features
$\qquad$- There are dependent features in the input data. DTR handles these correlations well.
Advantages:
$\qquad$- Can export tree structure to see which features the tree is splitting on
$\qquad$- Handles sparse and correlated data well
$\qquad$- Able to tune the model to help with overfitting problem
Disadvantages:
$\qquad$- Prediction accuracy on complex problems is usually inferior to gradient-boosted trees.
$\qquad$- A forest is less interpretable than a single decision tree.
Sklearn Package: tree
Required Args: None
Helpful Args: criterion and max_depth
Variations: gradient-boosted trees
Grid Search:
n_estimators: int, default=100
The number of trees in the forest.criterion: str, default=”mse”
The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.max_depth: int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.max_features: str, int or float, default=”auto”
The number of features to consider when looking for the best split.
model = RandomForestRegressor
model_name = type(model()).__name__
param_grid = {
'max_depth': [None, 10, 50, 100],
'max_features' : ['auto', 'sqrt', 'log2'],
'n_estimators': [10, 50, 100],
'criterion': ['mse', 'mae']
}
useful_columns = ['param_' + k for k in param_grid.keys()] + ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
%%time
grid = grid_search(X_train, y_train, model, param_grid)
pd.DataFrame(grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
%%time
__grid = grid_search(__X_train, __y_train, model, param_grid)
pd.DataFrame(__grid.cv_results_)[useful_columns].sort_values('rank_test_score').head()
pred = grid.best_estimator_.predict(X_test)
__pred = __grid.best_estimator_.predict(__X_test)
fig, axes = plt.subplots(1, 2)
ax=axes[0]
relative_error_distrib_plot(pred, y_test, model_name, label='Test set with transformation', ax=ax)
relative_error_distrib_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test set without transformation', ax=ax)
ax=axes[1]
ax.axhline(y=0, xmin=0, xmax=1, linestyle='--', color='#999', label='Perfect fit line')
relative_error_plot(pred, y_test, model_name, label='Test samples with transformation', ax=ax)
relative_error_plot(__pred, __y_test, model_name, unfit=False, color='#2f9599', label='Test samples without transformation', ax=ax)
pass
The RandomForestRegressor is the best model I have tried. It reach the best results fot both datasets, transformed and untransformed data and is a few better on the transformed dataset. I will stop the models enumeration and will compare the final results.
rmse = RMSE(pred, y_test, unfit=True)
__rmse = RMSE(__pred, __y_test, unfit=False)
params = grid.best_params_
__params = __grid.best_params_
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 1, 'RMSE':rmse, **params}, ignore_index=True)
RESULTS = RESULTS.append({'model' : model_name, 'is_preprocess': 0, 'RMSE':__rmse, **__params}, ignore_index=True)
RESULTS
RESULTS.to_csv('model_results.csv', index=False)
results = pd.read_csv('model_results_final.csv')
results
As I ran the same cells to have an homogeneous way to compare the models, the model comparison will be really easy. I only group the results but dataset and sort the RMSE values ascendingly according to the transformed dataset.
results_pivot = results.pivot("model", "is_preprocess", "RMSE")
results_pivot = results_pivot.reindex(results_pivot.sort_values(by=1, ascending=True).index)
results_pivot.plot(kind='bar', color=['#2f9599', '#80f'], alpha=0.5, edgecolor='#fff')
plt.title('Model Comparison')
plt.xlabel('Models')
plt.ylabel('RMSE on test sets')
results_pivot
Here we have the same conclusions, the KNeighborsRegressor is a poor model to predict the number of rented bikes, then the Linear models are better preferably without any regularization. And the RandomForestRegressor is clearly the best model to choose for the two dataset, even it works better on the transformed one.
It can be interesting to see what are the most important features on our case to predict the number of bike rented for a given hour. Of course, as we have 89 feautures I will limit the bar plot to the top 30 features considering the other ones are not important to our result.
rf_model = grid.best_estimator_
feature_importances = {
feature: importance
for feature, importance in zip(X_train.columns, rf_model.feature_importances_)
}
feature_importances = {k: v for k, v in sorted(feature_importances.items(), key=lambda item: item[1], reverse=True)[:30]}
plt.bar(range(len(feature_importances)), list(feature_importances.values()), align='center', color='#8800ff', alpha=0.5)
plt.xticks(range(len(feature_importances)), list(feature_importances.keys()), rotation=90)
plt.title('Feature Importance (top 30 only)')
plt.xlabel('Features')
plt.ylabel('Importance')
pass
Not surprisingly, the meteorological feautures are globally the most important ones. The the working hours are following with 18h for instance.
I we can conclude from this plot :
- There are more bikes rented when the weather is good and the temperature is high
- There are more bikes rented during peak hours
I have chosen my model so I need to create the methods for the model to be deployed.
First I save my RandomForestRegressor model
pkl_filename = "rf_model.pkl"
with open(pkl_filename, 'wb') as f:
pickle.dump(rf_model, f)
On the client side, it will be ask to create a matrix of features to predict. Each row is a list of ordered features as following :
| Feature | Format | |||
|---|---|---|---|---|
| Date | dd/mm/yyyy | |||
| Hour | int | |||
| Temperature | °C | |||
| Humidity | % | |||
| Wind speed | m/s | |||
| Visibility | 10m | |||
| Dew point temperature | °C | |||
| Solar Radiation | MJ/m2 | |||
| Rainfal | mm | |||
| Snowfall | cm | |||
| Seasons | {"Winter", "Autumn", "Spring", "Summer"} | |||
| Holiday | {"Holiday", "No Holiday"} | Feature | Format |
I recreate the preprocessing function :
def dummies_from_file(df, file_name):
_df = df.copy()
with open(file_name) as f:
dummies = json.load(f)
for column, values in dummies.items():
for value in values:
_df[f'{column}_{value}'] = (_df[column].astype(str) == value).astype(int)
_df = _df.drop(column, axis=1)
return _df
def norm_from_file(df, file_name):
_df = df.copy()
with open(file_name) as f:
transformations = json.load(f)
for column, t in transformations.items():
if column == 'n_bike':
continue
elif t['function'] == 'standardization':
_df[column] = (_df[column] - t['mean']) / t['std']
elif t['function'] == 'normalization':
_df[column] = _df[column] / t['maxi']
return _df
def preprocess(values, return_df=False):
_df = pd.DataFrame(values, columns = [
'date', # Date (dd/mm/yyyy)
'hour', # Hour (int)
'temp', # Temperature (°C)
'hum', # Humidity (%)
'wind', # Wind speed (m/s)
'visb', # Visibility (10m)
'dew', # Dew point temperature (°C)
'solar', # Solar Radiation (MJ/m2)
'rain', # Rainfall(mm)
'snow', # Snowfall (cm)
'season', # Seasons ({"Winter", "Autumn", "Spring", "Summer"})
'holiday', # Holiday ({"Holiday", "No Holiday"})
])
# set boolean string to boolean
_df['holiday'] = (_df['holiday'] == "Holiday").astype(int)
# get datetime and its arguments
_df['date'] = pd.to_datetime(_df['date'], format="%d/%m/%Y")
_df['year'] = _df['date'].dt.year
_df['month'] = _df['date'].dt.month_name()
_df['day'] = _df['date'].dt.day
_df['week_day'] = _df['date'].dt.day_name()
_df['working_day'] = (_df['date'].dt.dayofweek < 5).astype(np.int)
_df = _df.drop('date', axis=1)
# meteorological arguments
_df['wind'] = np.log(_df['wind'].astype(float) + 1)
_df['solar'] = np.sqrt(_df['solar'].astype(float))
_df['rain'] = _df.rolling(2, min_periods=1)['rain'].mean()
_df['dryness'] = 1 / (_df.rain + 1)
_df = _df.drop('rain', axis=1)
_df['snow'] = _df.rolling(8, min_periods=1)['snow'].mean()
_df['snowing'] = (_df.snow > 0).astype(np.int)
_df = _df.drop('snow', axis=1)
_df['invisb'] = _df.visb.max() - _df.visb
_df = _df.drop('visb', axis=1)
# complete preprocessing
_df = dummies_from_file(_df, 'dummies.json')
_df = norm_from_file(_df, 'transformations.json')
if not return_df:
_df = _df.values
return _df
To test if my functions are well implemented, I get the raw values from the csv file and I perform the preprocessing on this matrix.
df = pd.read_csv("SeoulBikeData.csv")
df = df.drop(['Rented Bike Count', 'Functioning Day'], axis=1)
values = df.values
values
df = preprocess(values, return_df=True)
values = preprocess(values, return_df=False)
Then I compare the X I had on the part 5.4 Train-test split with the matrix of values from raw for :
if sum([X_col != df_col for X_col, df_col in zip(X.columns, df.columns)]):
print('Columns are not in the same order')
else:
print('Columns are in the same order')
err = X.copy
for col in X.columns:
a, b = X[col], df[col]
err = (a - b).mean()
if err != 0:
print(f'Error for column {col}:{err}')
Columns are in the same order and are the same. All the features are also the same except for the dryness but it can be cause by an approximation computation. Everything is clean.
Then remains the server side where the model is deployed. From the client side, it will be asked to resquest the API (host on local host for now) from the '/predict' route with the json matrix of features to predict.
Here the python code from the deployed_model.py file:
# Import libraries
import json
import pickle
import numpy as np
from flask import Flask, request, jsonify
app = Flask(__name__)
# Load the model
model = pickle.load(open('rf_model.pkl','rb'))
with open('transformations.json') as f:
transformations = json.load(f)
mean, std = transformations['n_bike']['mean'], transformations['n_bike']['std']
@app.route('/predict',methods=['POST'])
def predict():
# Get the data from the POST request.
data = request.get_json(force=True)
# Make prediction using model loaded from disk as per the data.
inputs = np.array(data['inputs'])
prediction = (model.predict(inputs) * std + mean) ** 2
return jsonify(list(prediction))
if __name__ == '__main__':
app.run(port=5000, debug=True)
We can run this server from terminal running the python deployed_model.py or python3 deployed_model.py according to your OS.
I can first check I can run the request without any issue as a sanity check
import requests
url = 'http://localhost:5000/predict'
def serialize(df):
return [[value for value in row] for row in df.values]
r = requests.post(url, json={'inputs': serialize(df)})
The model is runi
pred = np.array(r.json())
pred_true = grid.best_estimator_.predict(df.values)
err = pred - target_unfit(pred_true)
err.mean()
%%capture
!jupyter nbconvert Axel_THEVENOT_Python_For_Data_Analysis.ipynb --to HTML --template classic
!jupyter nbconvert Axel_THEVENOT_Python_For_Data_Analysis.ipynb --to HTML --template classic --no-input --output "Axel_THEVENOT_Python_For_Data_Analysis_report.html"